Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata
Attaching package: 'openintro'
The following object is masked from 'package:gt':
sp500
library(ggrepel)library(patchwork)# set theme for ggplot2ggplot2::theme_set(ggplot2::theme_minimal(base_size =11))# set width of code outputoptions(width =65)# set figure parameters for knitrknitr::opts_chunk$set(fig.width =7, # 7" widthfig.asp =0.618, # the golden ratiofig.retina =3, # dpi multiplier for displaying HTML output on retinafig.align ="center", # center align figuresdpi =300# higher dpi, sharper image)if (!require("pacman")) install.packages("pacman")
Loading required package: pacman
# use this line for installing/loadingpacman::p_load(tidyverse, glue, scales, ggthemes) devtools::install_github("tidyverse/dsbox")
Using github PAT from envvar GITHUB_PAT. Use `gitcreds::gitcreds_set()` and unset GITHUB_PAT in .Renviron (or elsewhere) if you want to use the more secure git credential store instead.
Skipping install of 'dsbox' from a github remote, the SHA1 (244ecdfe) has not changed since last install.
Use `force = TRUE` to force installation
1 - Road traffic accidents in Edinburgh
# Read in data from accidents fileaccidents <-read_csv(here("data" ,"accidents.csv"), show_col_types =FALSE)# Create a new column - wrangle the dataaccidents_wrangle <- accidents |>mutate(week =case_when( day_of_week %in%c("Saturday", "Sunday") ~"Weekend",TRUE~"Weekday" ),week =fct_relevel(week, "Weekday", "Weekend") ) # Create the plotggplot(accidents_wrangle, aes(x = time, fill = severity, group = severity)) +geom_density(color ="black", alpha =0.5) +scale_fill_manual(values =c("#AA93B0", "#9ECAC8", "#FEF39F")) +labs(x ="Time of day",y ="Density",title ="Number of accidents throughout the day",subtitle ="By day of the week and severity",fill ="Severity" ) +facet_wrap(~ week, nrow =2) +theme_minimal(base_size =11)
Description TODO
2 - NYC marathon winners
2a
# Read in data from nyc marathon filemarathon <-read_csv(here("data" ,"nyc_marathon.csv"), show_col_types =FALSE)# Remove NA in time hrs columnsmarathon <- marathon %>%filter(!is.na(time_hrs))# Create the histogram plotggplot(marathon, aes(x = time_hrs)) +geom_histogram(binwidth =0.1) +labs (title ="Histogram of all runners in data set",x ="Time in hours",y ="Count" ) +theme_minimal(base_size =11)
# Create the box plotggplot(marathon, aes(x = time_hrs)) +geom_boxplot(outlier.size =2) +labs (title ="Box plot of all runners in data set",x ="Time in hours", ) +theme_minimal(base_size =11)
The histogram makes it easier to gauge the total number of runners and to understand how many runners finished within each time range. It provides a clear visual representation of the distribution of finish times across the dataset. In contrast, the boxplot does not clearly show the quantity of runners, but it is effective for identifying trends such as the median finish time and any outliers.
2b
# Create the histogram plotggplot(marathon, aes(x = time_hrs, fill = division)) +geom_histogram(binwidth =0.1) +scale_fill_manual(values =c("Men"="cornsilk4", "Women"="deepskyblue3") ) +labs (title ="Histogram of all runners in data set",x ="Time in hours",y ="Count",fill ="Division" ) +theme_minimal(base_size =11)
# Create the box plotggplot(marathon, aes(x = division, y = time_hrs, fill = division)) +geom_boxplot(outlier.size =2) +scale_fill_manual(values =c("Men"="cornsilk4", "Women"="deepskyblue3") ) +labs (title ="Box plot of all runners in data set",x ="Division",y ="Time in hours",fill ="Division" ) +theme_minimal(base_size =11)
Although not required, I recreated the same plot updates for the histogram as a practice exercise. The boxplot, with the fill parameter set to division (men or women), provides a clear visual comparison of performance based on gender. It shows that, in general, men tend to finish before the 2.25-hour mark, while women typically finish closer to or just after the 2.5-hour mark. The outliers for the men’s division extend slightly beyond 2.5 hours, whereas the women’s outliers appear after the 3-hour mark.
2c
TODO
2d
# Read in data from nyc marathon filemarathon <-read_csv(here("data" ,"nyc_marathon.csv"), show_col_types =FALSE) |>mutate(decade_raced = (year %/%10) *10,decade_raced_cat =case_when( decade_raced <=1970~"1970 or before", decade_raced >=2020~"2020 or after",TRUE~as.character(decade_raced) ) )# Remove NA in time hrs columnsmarathon <- marathon %>%filter(!is.na(time_hrs))# Add new variablemarathon |>count(decade_raced)
# Create the box plotggplot(marathon, aes(x = decade_raced_cat, y = time_hrs, fill = division)) +geom_boxplot(outlier.size =2) +scale_fill_manual(values =c("Men"="cornsilk4", "Women"="deepskyblue3") ) +labs (title ="Marathon Finish Times by Decade and Division",x ="Year",y ="Time in Hours",fill ="Division" ) +theme_minimal(base_size =11)
This plot gives us a better idea of performance for men and womens race times based on years (or decades).TODO
3 - US counties
3a
The code displays a scatter plot of median education versus median household income. It also overlays a box plot of population in 2017 by smoking ban status on the same graph. Although the code runs without errors, it combines two unrelated visualizations on a single plot, which makes the results confusing and difficult to interpret.
In addition, both layers use the same color scheme, making it hard to distinguish between the two plot types. Next, the y-axis must represent either income or population, but not both. The visual is misleading. Lastly, there is a category labeled ‘NA’ on the x-axis, from smoking_ban, that doesn’t represent valid data and should be removed or handled.
3b
It is easier to compare poverty levels in the first plot, where the faceting variables are displayed vertically or in rows. From this plot, we observe that individuals who are older, more educated, and have lower poverty levels are more likely to be homeowners. This layout makes patterns easier to identify and shows clearer comparisons across groups.
In contrast, the version where faceting variables are displayed horizontally by column can be misleading. It gives the impression that individuals with only a high school diploma or some college experience are more likely to own homes at a younger age, which is not necessarily supported by the data. This highlights the importance of intentional facet layout. Row based faceting displays better comparisons when there are multiple categorical groupings.
# Plot B ggplot(county %>%filter(!is.na(median_edu))) +geom_point(aes(x = homeownership, y = poverty)) +geom_smooth(aes(x = homeownership, y = poverty), color ="blue", se =FALSE) +labs (title ="Plot B", ) +theme_gray(base_size =11)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'
# Plot C ggplot(county %>%filter(!is.na(median_edu))) +geom_point(aes(x = homeownership, y = poverty)) +geom_smooth(aes(x = homeownership, y = poverty, group = metro), color ="green", se =FALSE, show.legend =FALSE) +labs (title ="Plot C", ) +theme_gray(base_size =11)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'
# Plot Dggplot(county %>%filter(!is.na(median_edu))) +geom_smooth(aes(x = homeownership, y = poverty, group = metro), color ="blue", se =FALSE, show.legend =FALSE) +geom_point(aes(x = homeownership, y = poverty)) +labs (title ="Plot D", ) +theme_gray(base_size =11)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'
# Plot Eggplot(county %>%filter(!is.na(median_edu))) +geom_point(aes(x = homeownership, y = poverty, color = metro)) +geom_smooth(aes(x = homeownership, y = poverty, linetype = metro), color ="blue", se =FALSE) +labs (title ="Plot E", ) +theme_gray(base_size =11)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'
# Plot Fggplot(county %>%filter(!is.na(median_edu))) +geom_point(aes(x = homeownership, y = poverty, color = metro)) +geom_smooth(aes(x = homeownership, y = poverty, color = metro), se =FALSE) +labs (title ="Plot F", ) +theme_gray(base_size =11)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'
# Plot Gggplot(county %>%filter(!is.na(median_edu))) +geom_point(aes(x = homeownership, y = poverty, color = metro)) +geom_smooth(aes(x = homeownership, y = poverty), color ="blue", se =FALSE) +labs (title ="Plot G", ) +theme_gray(base_size =11)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'
# Plot Hggplot(county %>%filter(!is.na(median_edu))) +geom_point(aes(x = homeownership, y = poverty, color = metro)) +labs (title ="Plot H", ) +theme_gray(base_size =11)
4 - Credit Card Balances
4a
# Read in data from credit filecredit <-read_csv(here("data" ,"credit.csv"), show_col_types =FALSE)# Create the plotggplot(credit) +geom_point(aes(x = income, y = balance, color = student, shape = student), show.legend =FALSE) +geom_smooth(aes(x = income, y = balance, color = student), method ="lm", se =FALSE, show.legend =FALSE) +scale_color_manual(values =c("Yes"="#AA93B0", "No"="#9ECAC8") ) +scale_shape_manual(values =c("circle", "triangle" ) ) +labs (x ="Income",y ="Credit card balance",caption ="https://stackoverflow.com/questions/26191833/add-panel-border-to-ggplot2" ) +scale_x_continuous(labels =label_dollar(suffix ="K")) +scale_y_continuous(labels =label_dollar()) +facet_grid( student ~ married,labeller =labeller(student =c("Yes"="student: Yes", "No"="student: No"),married =c("Yes"="married: Yes", "No"="married: No") ) ) +theme_minimal(base_size =14) +theme(panel.border =element_rect(color ="black", fill =NA, linewidth =0.5),strip.background =element_rect(fill ="gray90", color ="black"), )
`geom_smooth()` using formula = 'y ~ x'
The general trend shows that individuals with higher incomes tend to carry higher credit card balances. Among lower income earners, there is greater variability. Some have low balances while others carry substantial debt.
Students who are not married typically have the lowest incomes, but still carry relatively high credit card debt. In contrast, married students tend to have slightly higher incomes and lower balances, suggesting more financial stability.
Married individuals who are not students appear to include the highest income earners. Some of them also hold the largest credit card balances and these are the most extreme outliers in the dataset. Meanwhile, unmarried non-students show a more even distribution in both income and credit card balance.
4b
The combination of marital and student status, along with income, provides insight into credit card balance trends. Married individuals who are not students generally have the highest incomes and include outliers with the largest credit card balances. In contrast, unmarried students tend to have the lowest incomes and display a wide range of credit card debt levels.
Unmarried students have the highest credit card utilization relative to their income. Similarly, married students with lower incomes also show high utilization rates. Overall, there is a trend where lower income individuals tend to use a larger portion of their available credit.
In contrast, while married non-students often carry higher credit card balances, their utilization rates remain relatively low, typically below 20% of their credit limit. This suggests that higher income earners tend to have greater available credit and are less reliant on it, while lower income individuals may have limited credit access and use a higher percentage of what is available to them.